COVID-19 infection and transmission includes complex sequence diversity

SARS-CoV-2 whole genome sequencing has played an important role in documenting the emergence of polymorphisms in the viral genome and its continuing evolution during the COVID-19 pandemic. Here we present data from over 360 patients to characterize the complex sequence diversity of individual infections identified during multiple variant surges (e.g., Alpha and Delta). Across our survey, we observed significantly increasing SARS-CoV-2 sequence diversity during the pandemic and frequent occurrence of multiple biallelic sequence polymorphisms in all infections. This sequence polymorphism shows that SARS-CoV-2 infections are heterogeneous mixtures. Convention for reporting microbial pathogens guides investigators to report a majority consensus sequence. In our study, we found that this approach would under-report sequence variation in all samples tested. As we find that this sequence heterogeneity is efficiently transmitted from donors to recipients, our findings illustrate that infection complexity must be monitored and reported more completely to understand SARS-CoV-2 infection and transmission dynamics. Many of the nucleotide changes that would not be reported in a majority consensus sequence have now been observed as lineage defining SNPs in Omicron BA.1 and/or BA.2 variants. This suggests that minority alleles in earlier SARS-CoV-2 infections may play an important role in the continuing evolution of new variants of concern.


Introduction
Early investigation into the outbreak of the novel pneumonia of unknown cause, from the city of Wuhan in late December 2019, quickly identified the genomic sequence [1] associated with a coronavirus, closely related to the SARS-CoV [2,3]. The virus, SARS-CoV-2, is now understood to be the causative agent of COVID-19. Sequence variation across the~30kb positivestrand RNA SARS-CoV-2 genome was reported in the earliest studies, including single nucleotide polymorphisms (SNPs), insertions and deletions (indels) [1,4,5]. Significant efforts have since focused on characterizing sequence variation and documenting the evolution of SARS-CoV-2 strains during the pandemic through over 1.8 million sequences submitted from 172 countries as of April 2021 [6]. Extensive effort [7][8][9][10] has been and continues to be performed to define and track emergence of lineages across the ongoing evolution of SARS-CoV-2 populations around the world. This effort has captured and presented evidence of the continuous emergence of the major SARS-CoV-2 lineages sharing sequence variations across the global landscape during the pandemic. Many notable lineages including, B.1.1.7 (Alpha), B.1.351 (Beta), P.1 (Gamma), B.1.617.2 (Delta) have been associated with important surges in COVID-19 transmission and on-going phases of the pandemic [8,11]. With the surge of each new variant, concern naturally arises (and evidence accumulates) about strain-specific transmission and virulence, as well as the effectiveness of COVID-19 diagnostics, therapeutics, and vaccines.
From earlier studies of the SARS-CoV-1 virus, it was suggested that the error-correcting capacity of the RNA-dependent RNA polymerase (RdRP) conferred by nonstructural protein 14 (NSP14) [12,13] would contribute to replication error rates significantly lower than other RNA viruses (e.g. HIV-1 and influenza virus) [14,15]. Studies comparing SARS-CoV-1 and SARS-CoV-2 estimate mutation rates that range from 1-3 × 10 −3 substitutions/site/year [7,[16][17][18], respectively (extrapolating to 30-90 mutations/genome/year) are consistent with this prediction. Data provided through the Global Initiative on Sharing All Influenza Data (GISAID), and NextStrain provides a continuously updating report on SARS-CoV-2 lineage classification and illustrates that an apparently low mutation rate is no guarantee that the virus will exhibit limited capacity for variation, particularly once it has become so widely dispersed through millions of daily infections [5,19,20]. A further potential source of sequence complexity includes involvement of more than one variant in an individual infection.
Data analyzed for lineage tracking relies on extensive curation and validation of SARS-CoV-2 genomic data. This includes minimization and resolution of ambiguous sequence data. In fact, most sequence data used to monitor the COVID-19 pandemic distills the whole genome sequence generated from an infection sample, to a single consensus sequence (the majority nucleotide present at each genomic position) [21][22][23][24] to national and international databases [25,26]. Out of the wide range of SARS-CoV-2 genomics studies, sequence variation within infected individuals has been described in relatively few studies as intra-host single nucleotide variants (iSNVs) [27][28][29][30][31][32][33][34][35]. Outcomes of these studies appear to be mixed in their assessments of individual infection diversity and transmission outcomes but inevitably reveal and acknowledge that these variations within a single infection exist.
Here we evaluate SARS-CoV-2 whole genome data from a sample of patients, administrative and medical staff experiencing COVID-19 in the VA Northeast Ohio Healthcare System. We have applied stringent inclusion criteria based on SARS-CoV-2 genome coverage and read-depth to normalize comparisons across sequences derived independently from individual patient isolates as well as patients who have had direct prolonged contact with other infected individuals. We examine multiple perspectives, including details of individual infection diversity and transmission outcomes which are critical for the successful management of the COVID-19 pandemic. The findings reveal details about SARS-CoV-2 infection and transmission complexity that are significantly under-reported.

SARS-CoV-2 genome sequence coverage and detection of genetic variation
Full genome sequences from 364 SARS-CoV-2 infections (254 from present study; 110 from international data repositories) were analyzed in comparison to the Wuhan reference genome (GenBank NC_045512). A total of 254 samples from patients (N = 179) and healthcare personnel (N = 75) with COVID-19 were collected to (a) perform high-resolution contact tracing of COVID-19 infections and (b) gain insight into SARS-CoV-2 transmission dynamics between donor and recipient individuals [33][34][35]. Because of the infection control and outbreak monitoring nature of this work, the individuals represented a wide demographic range, including gender, age, race, and health status. Of 175 patients with available clinical information, the mean age was 61.4 years (range, 24 to 98), 31 (18%) were asymptomatic, 58 (33%) were hospitalized for COVID-19 or non-COVID-19 conditions, 32 (18%) had severe COVID-19, and 11 (6%) died due to complications of COVID-19. Twenty-four (14%) patients with COVID-19 were moderately or severely immunocompromised, including 17 with active treatment for malignancy with chemotherapy or severely immunosuppressive medications, 2 with advanced or untreated HIV infection, 1 with an organ transplant requiring immunosuppressive therapy, and 4 with other chronic medical conditions requiring immunosuppressive therapy. All patient samples have been analyzed as part of an effort to assess the extent of sequence variability within SARS-CoV-2 infections across different surges associated with variants of interest and concern from June 2020 to October 2021. An additional 110 samples from the Sequence Read Archive (SRA) have been analyzed for assessment of diversity outside the local region.
Each sample was initially evaluated by RT-PCR to approximate levels of infection that would enable high-resolution evaluation of individual infection complexities. The Ct values for samples that resulted in SARS-CoV-2 full genome coverage (�80% of the genome covered at �100X read-depth) were significantly lower (Average Ct value of 27.96 [n = 138]), implying higher viral RNA copy number than samples that did not meet this coverage threshold (Average Ct value of 35.85 [n = 103]) (Fig 1A; Mann-Whitney U = 1,971.5, P<0.0001). Coverage across the SARS-CoV-2 genome for samples with sequence generated above (black line) and below (gray line) the coverage threshold is summarized in Fig 1B and direct correlation between Ct and average coverage for each individual sample is provided in Fig 1C. Among the 140 (two samples did not have RT-PCR data) sequences meeting our technical thresholds, the average coverage was 1,081X (range 223 to 2,235X) with an average of 279,408 uniquely mapped reads per sample (min = 63,296, max = 2,680,239). Further examination of our data showed that the ability to detect SNPs was not influenced by Ct or average coverage if a minimum coverage was reached (Fig 1D and 1E). Samples that passed our inclusion criteria, �80% of the genome covered at �100X read depth, showed comparable number of SNPs defined by alternate allele frequency (AAF) > 5% (filled black circles) or SNPs defined by AAF > 50% (filled diamonds), regardless of Ct value. As expected, fewer SNPs were detected in samples that did not meet our inclusion (open circles and diamonds). With read depth of �100X, we observed consistency in detecting expected sequence variations within our sampled population and across genomic sequences reported into the international databases. From these analyses we established confidence in detecting minor alleles at frequencies �5% (whether reference or alternate allele). For the 114 samples not meeting our criteria for inclusion, the average coverage was 71X (range 0 to 1,281X) and these samples were removed from subsequent analyses. A comparison of the genome sequence data across the total sampling period of our survey showed a significant increase in the number SNPs detected over time ( Fig  1F and Table 1) with average coverage remaining consistent through this period; approximately doubling (30 to 65 AAF �5%) or tripling (10 to 35 AAF �50%) from April 2020 to August 2021. The two samples demonstrating substantially higher SNP counts (Fig 1D-1F red and green arrows) will be covered in their epidemiological context below.

Consensus sequences and SARS-CoV-2 strain polymorphism
Enumeration of SNPs across the SARS-CoV-2 genes and a full listing of all SNPs observed in our study are presented in S1 Table (SNPs observed in �10 infections) and S2 Table (all SNPs), respectively. Alignment of the 140 SARS-CoV-2 consensus sequences (S3 Table) that met the coverage thresholds, revealed a total of 1,096 SNV positions. Among these polymorphisms, 406 SNPs were observed in multiple infection samples.
The variant positions across the SARS-CoV-2 genomic sequences detected with allele frequencies >5% reveal extensive biallelic polymorphism throughout our results (Figs 2 and 3). We first examine these data by assessing alternate vs reference allele frequency proportions

Fig 1. Assessment SARS-CoV-2 Amplicon Library Preparation and Genome
Coverage. a RT-PCR cycle thresholds predicting SARS-CoV-2 sequence read quality. Comparison of RT-PCR cycle thresholds for samples according to technical breadth and depth thresholds ((�80% of SARS-CoV-2 genome covered at �100X read-depth; met thresholds, n = 140; did not meet thresholds n = 114). b SARS-CoV-2 full genome sequence depth comparisons. Among 140 sequences meeting our technical thresholds, the average coverage was 1,081 X (range 223 to 2,235 X); for the 114 sequences not meeting our technical thresholds, the average coverage was 71 X (range 0 to 1,281 X). Among all samples, average uniquely mapped read count was 292,430 reads (min = 21, max = 2,680,239). Among samples that pass >80% 100X coverage, the average uniquely mapped read count was 279,408 reads (min = 63,296, max = 2,680,239). Among samples that did not pass our cutoff, the average uniquely mapped read count was 175,895 reads (min = 21, max = 1,283,414). c Correlation between RT-PCR Ct value and average viral genome coverage; black circles identify samples that met the coverage and read-depth thresholds, and gray circles identify samples that did not meet these thresholds. d Correlation between detection of alternate alleles and RT-PCR Ct value where alternate allele frequency (AAF) was >5% (black circles >80%100x and open gray circles <80%100x), or where AAF was >50% (blue diamonds >80% 100x, open blue diamonds <80% 100x). e Correlation between detection of alternate alleles and average SARS-CoV-2 genome coverage where alternate allele frequency (AAF) was >5% (black circles >80%100x and open gray circles <80%100x), or where AAF was >50% (blue diamonds >80% 100x, open blue diamonds <80% 100x). f Correlation between the number of alternate alleles identified in the 140 >80% 100x samples and their collection dates-AAF >5% (black circles), or AAF >50% (blue diamonds). Significant correlation was observed between an increase in AAFs during the sample collection time-period (AAF >5%, Pearson R = 0.76112; AAF >50%, Pearson R = 0.89253; both P-values <0.0001). Red arrows in d-e identify samples that carried significantly more sequence variation compared to expectation (Sample 106-110 SNPs; Sample 110-132 SNPs). Green arrows in f link the position of the two samples with additional sequence variation at the AAF >50% and AAF >5% thresholds.

SARS-CoV-2 infection and transmission complexity
(AAF:RAF) for the 75 SNV positions that were shared most commonly (in �10 samples) (Fig  2). The red dashed line in the figure at 50% allele frequency is the commonly accepted cutoff for variant detection indicating major allele. Only those alternate allelic SNPs with frequencies above 50% would be reported as part of the consensus sequence characterizing the virus present in the infection sample. Following this convention, 30 alternate allelic positions observed in our surveyed population would not have been reported in any of the infection-specific consensus sequences, totaling 2,865 SNP events. Among the 45 SNPs where AAFs did rise above the 50% threshold, the alternate allele would be reported in 1,143 occurrences (AAF >50%) and not reported in 1,553 (AAF 5-50%). Following the consensus sequence reporting convention across the 75 positions monitored, most of the SARS-CoV-2 variations (4,418 SNP out of 5,561, 79%) with alternate allele frequencies between 5-50% would not have been reported across the infections studied. These findings shown previously in Fig 1D-1F, predicting lower levels of polymorphism at a 50% vs 5% threshold, are consistent with this analysis. Our observations of biallelic variant positions are consistent with those from other studies describing subconsensus SNPs or iSNVs [27][28][29][30][31][32][33][34][35]. To determine how often biallelic variations were included in SARS-CoV-2 sequences submitted to international repositories, we selected and analyzed 110 additional COVID-19 infections (S4 Table) reported from 18 different laboratories or consortia with raw sequence data available in the sequence read archive (SRA; total whole genome SARS-CoV-2 sequences available = 41,525) [36]; the same �80% coverage, �100X read-depth inclusion threshold was applied. Results summarized in S1 Fig confirmed that all SRA files from the 110 external infections showed evidence of multiple biallelic iSNVs that would reflect the complexity of the SARS-CoV-2 infection (nucleotide positions characterized by biallelic iSNVs in our VA study population and the samples queried in the SRA are designated by a green dot in Figs 2 and S1). Similar to observations from the VA Northeast Ohio Healthcare System patient population, 46 alternate biallelic iSNVs from a total of 87 variable positions did not clear the 50% consensus threshold. Among the 41 SNPs where AAFs did rise above the 50% threshold, the alternate allele would be reported in 668 occurrences and not reported in 312. Following the consensus sequence reporting convention across the 87 positions, 36% of the SARS-CoV-2 variations with AAFs between 5-50% (369 iSNVs) would not have been reported across the infections studied. Given that all SRA data queried showed this complexity, we asked whether attempts to report biallelic sequence variation were being made through the use of standard ambiguity codes [37] used to identify heterozygosity in diploid genetic systems. We first observed that the international data repositories understandably discourage the use of ambiguity codes given the substantial effort in attempting to curate sequence and determine lineage relationships as the pandemic continues to emerge. When we queried the database of 1,801,208 SARS-Cov-2 sequences accessible through Nextstrain/

Fig 2. Assessment of Alternate and Reference Allele proportions (AAF:RAF) Across 140 Infections.
Box and whisker plots illustrate the spread of RAF:AAF proportions for each mutation (median (bar); 25 th and 75 th percentiles (bottom and top of box, respectively); 10 th and 90 th percentiles (bottom and top whiskers, respectively)). The red dashed line at 50% identifies the demarcation at which the alternate allele is recognized and reported as the conventional consensus nucleotide that characterizes the infecting virus. These 75 mutations were chosen for further analysis because they all showed evidence of biallelic states at greater than the 5% background threshold and were shared in more than 10 infections (blue dashed line). Among these varying positions, Illumina sequence reads for 30 mutations did not reach an AAF:RAF >50% for any of the 140 infections (shaded gray in aligned table); 45 positions exceeded the RAF:AAF 50% threshold for a portion of the 140 infections and would therefore have contributed to characterization of the infecting viral strain. The accompanying table enumerates the number of samples with AAF between 5 and 95%, <50% and >50%. Overall, among the 5,390 alternate alleles detected at a frequency >5%, just 975 (18%) were observed at >50% frequency necessary to be reported in consensus sequence. Nucleotide positions characterized by biallelic iSNVs in our VA study population and the samples queried in the sequence read archive (SRA) (S1   October 3, 2021), while the repository reports extensive nucleotide substitutions, results showed that 85.9% reported no ambiguity sequence codes suggesting no biallelic polymorphism in their SARS-CoV-2 sequences; 0.46% reported >10 suggesting potential for a complex multi-strain infection (S6 Table). Given our presentation that most, if not all, SARS-CoV-2 full genome sequences are characterized by the occurrence of multiple biallelic positions, we have integrated appropriate ambiguity codes at positions where the AAF is >5% and observed that sequences are characterized by a minimum of 12 and maximum of 132 biallelic ambiguities. In head-to-head comparisons, nucleotide-specific substitution was higher in Nextstrain/GISAID compared to our study (average substitution rate = 1.69, st. dev. = 0.52), while biallelic ambiguity code usage is 100-fold lower (average substitution rate = 0.0097, st. dev. = 0.018); overall identification of mutations and ambiguity usage is summarized in S5C to S5F Table. Due to the increasing levels of SNVs observed over the time of sample collection, we report the average, minimum and maximum numbers of biallelic positions in quartiles (April 1 to August 1, 2020; September 28 to December 9, 2020; December 10, 2020 to April 13, 2021; April 15 to August 18, 2021), corresponding to breaks in sample availability ( Table 1).

Epidemiological transitions of SARS-CoV-2 infection complexity
In addition to the accumulation of increased polymorphism across the viral genome over time, closer examination of the genomic variations in the summary zebra plot (so-called due to the gray-scale patterns; Fig 3A) revealed significant pattern shifts in the sequences that characterize the major variants in our sample population that correspond to epidemiological transition periods. Following the sample collection time series ( In evaluation of their genomic sequence data, variants are observed with AAFs > 5% but not reaching major consensus, that are consistent with B.1.617.2-specific SNPs (Fig 3B). Following conventional practices, the major variants (with frequencies above 50% allele frequency cutoff) identified for these infections would have been reported as B.1.1.7 and not a mixed strain infection of B.1.1.7/B.1.617.2. Observation of these minor alleles above the 5% cutoff demonstrates early detection of the next major variant of concern (Fig 1F). A final summary of the distribution of SNVs observed in our study population is presented in Fig 4. The data show that variation is concentrated among the genes encoding structural (mean = 4.2%; range-1.3% to 7.7%) and accessory proteins (mean = 5.3% range-3.1% to 7.4%), compared to non-structural proteins (mean = 3.1% range-2.9% to 3.3%). This is particularly true for positions characterized by biallelic variation where AAF may not rise above 50% (Black components of stacked histogram bars); of which many are observed in the SARS-CoV-2 spike protein.

Bottleneck evaluations and transmission dynamics
Examination of the total infection sequence composition has also been considered in our study of infection transmission complexity from multiple exposure events. We performed bottleneck assessment using the exact beta-binomial method of Sobel-Leonard et al. [38], that recognizes AAF and RAF proportions instead of treating nucleotide sequence variation on a presence/absence basis taken by alternative models [39,40]. These analyses focused on 11 donor-recipient pairs, some that have been studied previously [33][34][35]. While donors were known from transmission events that occurred during commuter van transport events, we performed reciprocal calculations (A to B and B to A) for all transmission pairs. Results from our overall analyses show the maximum likelihood estimate (MLE) of the bottlenecks among our transmission pairs was between 6 to 123 virions (Fig 5A). These results include transmission events that occurred toward the end of pandemic Year-1 (December 2020 and January 2021) with a viral population characterized by greater genetic diversity compared to the first quarter of 2020 (Fig 1F). Paired SNP proportions in the "transmission plots" (Fig 5B-5D) show the AAF for each position in the SARS-CoV-2 genome between the respective infected donor (left) and subsequent recipient (right). In these graphs, most of the nucleotide-specific data are at or below the 5% alternate allele threshold and therefore would be designated the reference allele (>29,000 nucleotide positions). Results further indicate that among the majority of biallelic positions, the allele frequencies were transmitted in very similar proportions (AAF: RAF). As such, the average allele frequency differences (AFΔ) between donor and recipient  Table. https://doi.org/10.1371/journal.pgen.1010200.g004 pairs were minimal in all transmission scenarios, regardless of exposure details (0.09 to 0.17%; Fig 5B to 5D).

Association between immunological competence and SARS-CoV-2 infection complexity
Finally, given published reports of extensive SARS-CoV-2 sequence diversity in association with infected immunocompromised (IM-) individuals [41][42][43][44], we have evaluated the complexity of SARS-CoV-2 sequence in the context of the immunological conditions of the infected individuals in our study. Because the infections studied included asymptomatic individuals and those who receive primary care outside the VA Northeast Ohio Healthcare System, it was not possible to conduct a complete association study on this focus and represents a limitation of our study. Additionally, we were not able to observe emergence of increasing numbers of SARS-CoV-2 mutations that have been described in longitudinal studies. Nevertheless, SARS-CoV-2 sequence meeting our analysis inclusion threshold was possible for 16 individuals with a range of clinical characteristics consistent with immunodeficiencies (including those classified as mild to severe by the CDC-see Methods, Patient Clinical Characteristics and Outcomes). Given the observations shown in Fig 1F,

Discussion
Our study has revealed significant within host infection diversity of SARS-CoV-2 through full genome sequence analysis. We greatly appreciate the work of international data repositories such as GISAID and NextStrain for curating the millions of reported infections and believe that these repositories are a tremendous and invaluable asset for the monitoring of SARS-CoV-2. The shear amount of data that needs to be collected and annotated makes it understandable and perhaps necessary to represent these infections by a single consensus sequence. While the practice of reporting a single unambiguous sequence assists with tracking SARS-CoV-2 in regard to evolution and global distribution of lineages, emphasis on reporting the majority nucleotide signature significantly reduces our ability to detect emerging new variants that could ignite a new surge with public health impact. Additionally, the majority consensus approach limits our ability to understand the complexity of infections, dynamic changes in the viral population over time within infected individuals, and between donor and recipient individuals across transmission events.
In this study, we addressed the majority consensus problem as it relates to SARS-CoV-2 infection through the use of standard ambiguity codes [37]. All consensus sequences reported to GenBank from this study have included ambiguity codes. Taking this approach, our study Transmission settings include a VA Clinical Ward b; a VA Emergency Department c; two independent commuter transport vans d. Inset date for each transmission graph show the maximum change in allele frequency (Max AFD), the median allele frequency change (Med AFD), and the average allele frequency change (Avg AFD). Allele sharing assessments were also performed for two randomly chosen pairs who did not work together or have known contact-  [46]. Confirming this potential will require consideration of minor alternate alleles as suggested here in more extensive whole genome analysis of SARS-CoV-2.
In our assessment of transmission from infected donors to recipients we were also interested to see that the overall sequence diversity was largely retained. This appeared to be true regardless of the AAF:RAF mixtures (Fig 5B to 5D). These transmission plots provide both quantitative and qualitative perspectives on SARS-CoV-2 complexity. Many of the AAF:RAF mixtures were observed to be present in donor and recipient sequences at nearly identical ratios. For the events captured in commuter vans (two independent events) where the direction of transmission was known to be from infected drivers to uninfected passengers, this suggests that the majority of SARS-CoV-2 strain diversity is included during transmission and subsequent infection complexity [35]. This appeared to be true even though transmission occurred within the context of circulating air. Maximum likelihood estimates of the bottlenecks among our transmission pairs (van driver to passengers as well as others Fig 5A) were similar or higher than those observed by Braun [32] and Lythgoe [31], but did not reach the highest values reported by Popa [29].
Finally, given reports that have suggested evolution of complexed populations of SARS-CoV-2 in immunocompromised people [41][42][43][44], we were interested to compare infections between IM+ and IM-individuals. While normalizing for overall complexity by comparing infections that occurred within time-similar cohorts, we did not observe increased numbers of iSNVs in IM-patients. This appears to differ from the observations of Al Khatib et al. who reported increased numbers of biallelic subconsensus variants (indicating AAF:RAF mixtures) in individuals classified as severe cases compared to individuals with mild cases [27] with comorbidities similar to the IM-patients studied here. As we indicated previously, we acknowledge limitations in this part of our study. We were not able to study single patients over extended time series known to have extended for up to one year by other investigators [43]. Therefore, we did not have the opportunity to compare potential evolution of SARS--CoV-2 variants within individual IM+ and IM-patients over time. Our conclusions are limited by the fact that we studied a small sample size and were unable to control for potential confounding factors such as duration of infection, genetic differences, and COVID-19 treatment.
The continuing dynamics of the COVID-19 pandemic have become increasingly complex and unpredictable. Underlying the challenge of COVID-19 transmission among humans, Sender et al. estimate that every infected person carries 1 billion to 100 billion virions during peak infection [47]. As a result, despite proof-reading activity within the SARS-CoV-2 RdRp that would limit mutation, efficient virus replication and transmission increasingly favors dispersal of mutations across abundant globally distributed infections (>460,000,000 to-date [19,20]). As levels of immunity stimulated by infection and vaccination fluctuate in the human population ((a) no infection or vaccine exposure; (b) previous infection(s) and no vaccination; (c) no infections and vaccination(s); (d) previous infection(s) and vaccination(s)), the global SARS-CoV-2 population will encounter significant heterogeneity in acquired and natural human host defense mechanisms. Given the global disparities influencing access to, and uptake of vaccines, as well as availability and practice of non-pharmaceutical interventions, the virus has a free range for making thousands of deleterious as well as advantageous mutations. Therefore, significant opportunity exists for random chance to determine what future chapters of the pandemic will present.
This emphasizes the need to study genetic complexity of SARS-CoV-2 infections more intensively and requires increased examination to achieve a greater understanding of the capacity of this virus to evade our best efforts at mitigation. While there are efforts to sequence the virus and document how it is evolving, most of the scientific community recognizes that this effort needs to be increased [48][49][50][51][52][53]. Coupled with this effort, approaches to report characteristics of multi-strain complex infections must be considered to enable closer monitoring and full understanding of the virus's evolutionary capacity.
Lest it be thought that under-reporting of the genetic diversity of infections occurs only with SARS-CoV-2 genome sequence data, consensus sequence reporting is the submission practice for most (if not all) infectious disease pathogens to national and international data repositories (https://VEuPathDB.org) from the US National Institute of Allergy and Infectious Diseases (NIAID) and the Wellcome Trust (example pathogens and vectors include biosecurity pathogens (e.g. Yersinia pestis (plague), Marburg virus, Ebola virus; pathogens of global concern [e.g. Plasmodium, Mycobacterium tuberculosis). Adequately addressing the underreporting of infection complexity represented in data repositories has significant potential to alter vitally important characteristics of infectious diseases, including assessment of drug treatment efficacy and resistance and vaccine escape.

Ethics statement
The study protocol was approved by the Cleveland VA Medical Center's Institutional Review Board (IRB protocol number 1584025) with a waiver of consent. This same Institutional Review Board approved the waiver of consent for this study.

Patient enrollment and informed consent
As part of efforts to identify, provide care for infected individuals and limit the spread of COVID-19 within the VA Northeast Ohio Healthcare System, the Infection Control Department collected nasopharyngeal swab specimens from all individuals showing symptoms of a respiratory infection, all patients admitted to the hospital or undergoing invasive medical procedures, and asymptomatic close contacts of COVID-19 cases starting in March 2020 in accordance with Centers for Disease Control and Prevention (CDC) recommendations [54]. Since this time, we have analyzed over 300 samples from a variety of sources to investigate SARS--CoV-2 whole genome sequence variation.

Patient clinical characteristics and outcomes
Nasopharyngeal swab specimens were collected from patients and healthcare personnel. Clinical information was not available for healthcare personnel. For the patients, medical record review was performed to obtain information on age, medical conditions, medications, and severity of illness. COVID-19 was considered severe if oxygen saturation was <94% on room air or respiratory rate was >30 breaths/min or lung infiltrates >50% were present (https:// www.covid19treatmentguidelines.nih.gov/overview/clinical-spectrum/). Patients were considered to be moderately or severely immunocompromised if they had conditions causing immunocompromise as defined by the Centers for Disease Control and Prevention (eg, receiving active cancer treatment for tumors or cancers of the blood, received an organ or stem cell transplant and taking medicine to suppress the immune system, advanced or untreated HIV infection, active treatment with high-dose corticosteroids or other drugs that suppress their immune response) (https://www.cdc.gov/coronavirus/2019-ncov/vaccines/recommendations/ immuno.html).

Patient sampling and RNA extraction
All nasopharyngeal swabs were immersed in 2 mL of universal transport medium (Copan Diagnostics, Murrieta, CA) and stored at -80˚C. RNA was extracted from 140ul the stored sample to yield approximately 100ul of purified RNA following Qiagen protocols (either QIAamp Viral RNA Mini Kit or the QIAcube HT, automated 96-well plate format).

Real-time PCR diagnosis of SARS-CoV-2 infection
Real-time PCR (RT-PCR) amplification of RNA was performed following the Luna Universal One-Step RT-qPCR kit (New England Biolabs). A CFX 96 Touch Real-Time PCR Thermocycler/ Detection System (Bio-Rad) was used to quantify the amount of viral RNA present in each sample. The primer/probe set was designed as part of the Food and Drug Administration, Emergency Use Authorization protocol [55] was followed to amplify the SARS-CoV-2 nucleocapsid gene target sequence (Forward primer (2019-nCoV_N1-F) 5'-GACCCCAAAATCAGC GAAAT-3'; Reverse primer (2019-nCoV_N1-R) 5'-TCTGGTTACTGCCAGTTGAATCTG -3'; Probe (2019-nCoV_N1-P) design FAM-ACCCCGCATTACGTTTGGTGGACC-BHQ1). Temperature cycling conditions for the assay were as follows: 1 cycle of 55˚C for 10 minutes (reverse transcription), 1 cycle of 95˚C for 1 minute (initial denaturation), 45 cycles of 10 seconds at 95˚C (denaturation) and 30 seconds at 60˚C (extension). An in-plate fluorescence read was performed after each cycle of the 45-cycle amplification protocol. Melt curve analysis was added for verification at the end cycle. Standard curve calibration using known quantities of a SARS-CoV-2 RNA template (Exact Diagnostics) were performed for each new primer/probe lot used in this assay. This standard contains RNA transcripts for the N gene of SARS-CoV-2. Additional controls for E, ORF1ab, RdRp, and S genes were available, but stability of N gene amplification did not prompt varying RT-PCR assessments. The target gene transcript was quantitated to 200,000 copies/mL using digital droplet PCR. Each RT-qPCR run contained positive and no template controls, in addition to the negative control from the extraction process.

Whole-genome sequencing and data analysis
Whole-genome sequencing of SARS-CoV-2 was performed on 254 clinical samples from COVID-19 patients following library preparation reagents and protocols developed by Paragon Genomics (dual indexed CleanPlex SARS-CoV-2 Panel 918010 or 918011) [56]. Sequence was generated using an Illumina MiSeq sequencing platform. Raw sequences were aligned to COVID-19 reference sequence NC_045512.2. Primer sequences were trimmed from the ends of aligned reads using a masking file provided by Paragon Genomics using the software package fgbio. Samples were included in the final analysis if at least 80% of the genome was covered at > 100X read depth. Variants were called at positions with a minimum coverage of 100 reads with a base quality score >20. Positions were considered variable if any sample showed an alternate allele frequency at that position greater than 5%. These inclusion criteria are consistent and, in many ways, exceed benchmarks included in recently published quality assessments to evaluate efforts to sequence and describe SARS-CoV-2 genome sequence variation [57,58].
We expanded our analyses to determine whether our observation of frequent biallelic iSNVs were characteristic of SARS-CoV-2 sequences submitted to international repositories. For this analysis, we performed convenience sampling to select 110 additional COVID-19 infections reported from 18 different laboratories or consortia with raw sequence data available in the sequence read archive (SRA) (S4 Table). When supplied, further information provides more specific details on the laboratory or clinic that submitted the patient sample. Inclusion of SARS-CoV-2 full genome sequence for the samples selected from the SRA were required to meet the same sequence breadth and depth criteria established above; bioinformatic analyses were performed using the same approaches.
Consensus sequences were generated from this data by following conventional approaches, reporting the major nucleotide represented at each nucleotide position. Therefore, variant nucleotides in relation to NC_045512.2 were reported for any genomic position at which an alternate allele was detected at >50% in a sequenced sample. Additionally, sequences with ambiguity codes were also generated for allele frequencies between 5-95% using iVar Consensus (v.1.3) setting the minimum threshold t = 0.95. Thus, in contrast to reporting sequence variation as limited to the standard nucleotide states (A-adenine; G-guanine; C-cytosine; Uuracil/T-thymine) iVar reports standard ambiguity codes to identify varying positions in the SARS-CoV-2 genome where more than one nucleotide state was observed at a frequency > 5% at a specific genome position (e.g. R = observation of A and G [37]; see Supplement). These consensus sequences inclusive of biallelic ambiguity codes have been reported to GenBank (NCBI-GenBank Accession numbers OM988245-OM988384; S3 Table) and GISAID. Additionally, raw sequencing data have been submitted to the SRA.
Evaluation of ambiguity code usage of sequence data available through Nextstrain was performed using awk (a command line tool for text-based pattern searching) on metadata available for 1,801,208 SARS-Cov-2 sequences accessible through Nextstrain up to October 3, 2021; data accessed on October 10, 2021.
Supporting information S1   S4  Table regarding specific details on the laboratory or clinic that submitted the patient sample. Box and whisker plots illustrate the distribution of alternate allele frequencies for each mutation (median (bar); 25th and 75th percentiles (bottom and top of box, respectively); 10th and 90th percentiles (bottom and top whiskers, respectively)). The red dashed line at 50% identifies the demarcation at which the alternate allele is recognized and reported as the major consensus nucleotide that characterizes the infecting virus. These 87 mutations were chosen for further analysis because they all showed evidence of biallelic states at greater than the 5% background threshold and were observed in more than 2 infections (blue dashed line). Among these varying positions, 46 mutations did not reach an alternate allele frequency >50% for any of the 110 infections; 41 positions exceeded the 50% threshold for a portion of the 110 infections and would therefore have contributed to characterization of the infecting viral strain. The accompanying table enumerates the number of samples with AAF between 5 and 95%, <50% and >50%. Overall, among the 1,080 alternate alleles detected at a frequency >5%, just 657 (60.8%) were observed at an alternate allele frequency >50% necessary to be reported in consensus sequence. Nucleotide positions characterized by biallelic iSNVs in our VA study population (Fig 2) and the samples queried in the SAR are designated by a green dot. (TIF)

S2 Fig. Comparison of SNVs between Immunocompetent (IM+) and
Immunocompromised (IM-) Patients. IM-individuals are identified by red circles. Comparisons suggest that there is no difference in the SNVs observed at either threshold. Correlation between the number of alternate alleles identified in each sample and the sample collection date-AAF >5% (black circles), or AAF >50% (blue diamonds); samples were the 140 that were >80% 100x. Significant correlation was observed between an increase in AAFs during the sample collection time period (AAF >5%, Pearson R = 0.76112; AAF >50%, Pearson R = 0.89253; both P-values <0.0001). (TIF)